Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.1 - Check here for latest version

HTML to XML (Text Processing)

Synopsis

This operator converts a HTML document into an XML/XHTML document.

Description

The HTML to XML operator takes a document in the HTML-Format and parses it into strict XHTML, removing things as non-closed stand-alone tags and so on. This can be useful, if an XHTML document is required, or it's necessary that the document is fully valid.

Input

  • document

    The HTML-document that should be transformed.

Output

  • document

    The XHTML-Document.

Tutorial Processes

Replace invalid HTML tags

In this example, we first generate an HTML document, which contains a lot of non-XHTML-conform Tags, like a non-closed li, non-closed stand-alone tags and <H1> instead of <h1>.

So we pass on this document into the HTML to XML operator.

When we now open the results, we'll see that the operator has replaced all invalid tags by their valid representations.